AD-A262  276 

r  tll}f  I.  lan  - _ 


I  ifli 


IMENTATION  PAGE 


1  ^UOIlw 


j  1  title  and  subtitle  '■ 


FILTERING,  STATISTICAL  SIGNAL  PROCESSING 
&  VARIATIONAL  PROBLEMS  (U) 

I  s  AUThO’iS,  ■ — - - 


3  AEACRT  Tyat  AS:)  JA’-S  '^vtscr) 

FINAL/ 15  JAN  aq  TO  lA  OCT  92 

n  S- 


Professor  Sanjoy  Hitter 

7  PESfORMiNG  ORGA^l^A'IGN  'i"T“ . T- 

Massachusetts  Institute  of  Technology 
Informatioon  &  Decision  Systems 
Cambridge,  MA  02139 


230^:i/A6 

6U02F 


?ES^C^V  ^G  ORGAM.-a-  gs 
R£OOS’'  NUMBER 


Afote-TR.  9  3 


12a.  DISTRIBUTION  AVAILABIL  .- 


APPROVED  FOR  PUBLIC  RELEASE:  DISTRIBUTION  IS  UNLIMITED 


13.  abstract  CMaximum  200  . 


During  the  grant  period  the  PI  made  major  contributions  in  three  principal  areas: 
(1)  Robust  Kalman  filtering:  <2)  Structure  determination  for  X-ray  crystallography 
and  (3)  Stochastic  recursive  algoritluns  for  global  optimization.  These  theoretical 
advances  have  wide  applications  in  diverse  problems,  such  as  identification  of 
systems  using  maxiraura  likelihood  techniques,  filtering  in  the  presence  of 
non-Gaussian  observation  noise,  outlier  detection,  image  analysis,  and  phase 
estimation  problems. 


07t 


93-06632 


14  SUBJECT  terms 


1S  NUMBER  OF  PAGES 

.a  S' 

16.  PRICE  CODE 


f  of'report'^^'''^^^'^''  r®  P®  j  «^tract 

1 ^CLASSIFIED  UNCLASSIFIED  UNCLASSIFIED  SAR<SAME  AS  REPORT 


'i  u-O-O'  280-5500 


Standard  Form  298  Rev  :  fr 

Ov  4MV  <,t6  JJ'?  i 

m  *02 


A  Final  Scientific  Report  of  Research 
on  Filtering,  Statistical  Signal  Processing, 
and  Variational  Problems 


under  Grant  No.  89-0276 


for  the  period 

15  January  1989  to  14  October  1992 
submitted  to 


United  States  Air  Force 
Office  of  Scientific  Research  (AFOSR) 
Bolling  Air  Force  Base 
Washington,  D.C.  20032 


(Attention:  Dr.  Jon  A.  Sjogren,  Program  Manager, 
Probability,  Statistics,  and  Signal  Processing) 


by  Sanjoy  K.  Mitter 
Principal  Investigator 


.tn 


Laboratory  for  Information  and  Decision  Systems 
Massachusetts  Institute  of  Technology 
Cambridge,  Massachusetts  02139 

February  1993 


Accesion  For 

— 

NT)S  CRA&I  ^ 

DTIC  TAB  |~ 

Unannot.'i'ced  Q 

JllStif  ICotiOT) 

By 

Distribution  / 

AviJilahility  Codes 


!  Av<)i)  J'lO/or 
i  Sutcial 


I'M! 


.  -L. 


1  Introduction 


During  the  grant  period  15  January  1989  to  14  October  1992,  we  have  made  major  contributions 
in  three  principal  areas: 

•  Robust  Kalman  filtering; 

•  Structure  determination  for  X-:ay  crystallography;  and 

•  Stochastic  recursive  algorithms  for  global  optimization. 

These  theoretical  advances  have  wide  applications  in  diverse  problems  such  as  identification  of 
systems  using  maximum  likelihood  techniques,  filtering  in  the  presence  of  non-Gaussian  observation 
noise,  outlier  detection,  image  analysis,  and  ph.ase  estimation  problems. 

A  technical  overview  of  our  research  is  presented  in  Section  2.  This  is  followed  by  lists  of  the 
students,  post-doctor.il  fellows,  and  faculty  that  have  been  supported  by  the  grant,  in  Section  3; 
of  invited  presentations,  in  Section  4;  and  of  publications  based  on  the  work  described  herein,  in 
Section  5. 

2  Description  of  Research 

2.1  Robust  Kalman  Filtering 

2.1.1  Introduction 

Time-dependent  data  are  often  modeled  by  linear  dynamic  systems.  Such  representations  assume 
that  the  data  contain  a  deterministic  component  which  may  be  described  by  a  difference  or  differ¬ 
ential  equation.  Deviations  from  this  component  are  assumed  to  be  random,  and  to  have  certain 
known  distributional  properties.  These  models  may  be  used  to  estimate  the  “true"  values  of  the 
data  uncorrupted  by  measurement  error,  and  possibly  also  to  draw  inference  on  the  source  gener¬ 
ating  the  data. 

Kalman  Filtering  has  found  an  exceptionally  broad  range  of  applications,  not  only  for  estimating 
the  state  of  a  linear  dynamic  system  in  the  presence  of  process  and  observation  noise,  but  also 
for  simultaneously  estimating  model  parameters,  choosing  among  several  competing  models,  and 
detecting  abrupt  changes  in  the  states,  the  parameters,  or  the  form  of  the  model.  It  is  a  remarkably 
versatile  estimator,  originally  derived  via  orthogonal  projections  as  a  generalization  of  the  Wiener 
filter  to  non-stationary  processes,  then  shown  to  be  optimal  in  a  variety  of  settings:  as  the  w^eighted 
least-squares  solution  to  a  regression  problem,  without  regard  to  distributional  assumptions;  as  the 
Bayes  estimator  assuming  Gaussian  noise,  without  regard  to  the  cost  functional;  and  as  the  solution 
to  various  game  theoretic  problems. 

Neverthlesc,  the  Kalman  Filter  breaks  down  catastrophically  in  the  presence  of  heavy-tailed 
noise,  i.e.  outliers.  Even  rare  occurrences  of  unusually  large  observations  severely  degrade  its 
performance,  resulting  in  poor  state  estimates,  non-white  residuals,  and  invalid  inference. 

Statisticians  and  engineers  often  confront  the  problem  of  dealing  with  outliers  in  the  course 
of  mode!  building  and  validation.  Routinely  ignoring  unusual  observations  is  neither  wise,  nor 
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statistically  sound,  since  such  observations  may  contain  valuable  information  as  to  unmodekai 
system  characteristics,  model  degradation  or  breakdown,  measurement  errors,  etc.  But  detecting 
unusual  observations  is  only  possible  by  comparison  with  the  underlying  trends  and  behavior:  yet. 
it  is  precisely  these  that  non-robust  methods  fail  to  capture  when  outliers  are  present.  The  j)urpose 
of  robust  estimators  is  thus  twofold:  to  be  as  nearly  optimal  its  possible  when  there  are  no  outliers, 
i.e.  under  ‘‘normal’’  conditions;  and  to  be  resistent  to  outliers  when  they  do  occur,  i.c.  to  be  able 
to  extract  the  underlying  system  behavior  without  being  unduly  affected  by  spurious  values. 

Past  efforts  to  mitigate  the  effects  of  outliers  on  the  Kalman  Filter  rang^  -id  koc  prac 
tices  such  as  simply  discarding  observations  for  which  residuals  are  "too  large,”  to  more  formal 
approaches  based  on  non-parametric  statistics,  Bayesian  methods,  or  minimax  theory.  An  exten¬ 
sive  survey  of  the  literature  is  in  [34,  35].  Many  of  these  methods  include  heuristic  approximations 
with  ill-understood  characteristics.  While  some  have  been  empirically  found  to  work  well,  their 
theoretical  justifications  have  remained  scanty  at  best.  Their  nonlinear  forms,  coupled  with  t  he 
difficulties  inherent  in  dealing  with  non-normal  distributions,  have  resulted  in  a  strong  preference 
in  the  literature  for  Monte  Carlo  simulations  over  analytical  rigor. 

In  an  effort  to  provide  a  more  rigorous  basis  for  sub-optimal  filtering  in  the  presence  of  non- 
Gaussian  noise,  a  robust  recursive  estimator  has  been  derived  formally,  combining  Huber’s  theory 
of  minimax  robust  estimation  of  a  location  parameter,  recursive  estimators  ba.sed  on  the  stochastic 
approximation  theory  of  Robbins  and  Monro,  and  approximate  conditional  mean  estimation  based 
on  asymptotic  expansion.  An  overview  of  this  approach  appears  in  [32]. 

2.1.2  Preliminaries 

Below,  the  notation  C{x)  denotes  the  probability  law  of  the  random  vector  x  taking  values  in 

S)  denotes  a  multivariate  normal  distribution  with  mean  p  and  covariance  S.  and  A''(x;  p,  S) 
is  the  associated  probability  density  function. 

Consider  first  the  model 

In  =  (2.1) 

where 

Q.n+x  =  FnOn+uu,  (2.2) 

n  =  0, 1,--  -  denotes  discrete  time;  is  the  system  state,  with  a  random  initial  value 

distributed  as  £(2o)  =  A/’C^.Eo);  €  R'’  is  the  observation  (measurement);  €  R^  i.s  the 

process  (plant)  noise  distributed  as  £(w„)  =  Ar(0, Q„);  2;„  e  RP  is  the  observation  (measurement) 
noise  distributed  as  £(ii„)  —  IF,  a.  given  distribution  that  admits  a  density  and  has  mean  and 
variance  given  by  E[n„]  =  0  and  E[£„e][]  =  R;  {Fn},{Hn},{Dn},{Qn},  -o  and  R  are  known 
matrices  or  sequences  of  matrices  with  appropriate  dimensions;  ^  6  R**  is  a  known  vector;  and 
finally  Qj^,  w„,  and  are  mutually  independent  for  all  n. 

The  Kalman  Filter  is  the  estimator  of  the  state  0,^  given  the  observations  Z„  =  {zq,  •••.;„}, 
and  obeys  the  well-known  recursion 

L=L+  (2.3) 

where 

(2.4) 
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is  the  conditional  a  priori  estimate  of  the  state  at  time  n  (i.e..  I)efore  updating  by  the  observation 
~„)  and 


Mn  —  F„_1 

(2.5) 

is  the  conditional  a  priori  estimation  error  covariance  at  time  n. 

^n~  f^nin 

(2.6) 

is  the  innovation  at  time  n  and 

r„  -  +  D„RDl 

(2.7) 

is  its  covariance 

Kn  -  Mni/Jr;' 

(2.8) 

is  the  gain,  and 

Pn  =  Mn  -  /V„r„A J 

(2.9) 

is  the  a  posteriori  estimation  error  covariance  at  time  n  (i.e..  after  updating).  The  initial  condition 
ig  is  given. 

As  is  clear  from  Equations  2.3  and  2.6,  the  estimate  is  a  linear  function  of  the  observation,  a 
characteristic  that  is  optimal  only  in  the  case  of  normally  distributed  noise  or  elliptical  processes. 

which  are  sample-pathwise  mixtures  of  normal  processes.  Similarly,  Equations  2.5  and  2. 8-2. 9 
show  that  the  gain  and  covariance  are  independent  of  the  data,  a  property  related  once  again  to 
the  assumption  of  normality.  Finally,  the  Gaussian  case  T  —  A^(0,  i?),  the  residual  (innovation) 
sequence  is  white  and  is  distributed  as  =  A^{Q, F,). 

When  JF  is  a  heavy-tailed  distribution,  on  the  other  hand,  the  state  estimation  error  can  grow 
without  bound  (since  the  estimate  is  a  linear  function  of  the  observation  noise),  the  residual  se¬ 
quence  becomes  colored,  and  residuals  become  non-normal.  Thus,  not  only  is  the  estimate  poor, 
but  furthermore  invalid  inference  would  result  from  utilizing  the  residual  sequence  in  the  case  of 
significant  excursions  from  normality.  A  robust  estimator  should  at  the  very  least  have  the  follow¬ 
ing  characteristics:  the  state  estimation  error  must  remain  bounded  as  a  single  observation  outlier 
grows  arbitrarily;  the  effect  of  a  single  observation  outlier  must  not  be  spread  out  over  time  by  the 
filter  dynamics,  i.e.  a  single  outlier  in  the  observation  noise  sequence  must  result  in  a  single  outlier 
in  the  residual  sequence;  and  the  residual  sequence  must  remain  nearly  white  when  the  observation 
noise  is  normally  distributed  except  for  an  occasional  outlier. 

Such  behavior  could  be  obtained  by  replacing  Equation  2.3  by,  say. 

In  =  in  +  (2.10) 

where  is  an  influence-bounding  function  that  downweights  “large”  observations.  In  fact,  a 
number  of  robust  filters  in  the  literature  can  be  represented  in  the  form  2.10.  (See  [35].)  The 
significance  of  the  functional  ip  lies  in  the  fact  that  it  processes  the  innovation  so  as  to  mitigate 
the  effects  of  observation  outliers.  “Overprocessing”  the  data  results  in  loss  of  efficiency  at  the 
nominal  model,  while  “underprocessing”  makes  the  estimator  excessively  sensitive  to  outliers,  i.e. 
non-robust.  Some  researchers  have  chosen  ip  functions  on  the  basis  of  engineering  considerations, 
while  others  have  derived  them  on  probabilistic  grounds,  often  using  a  Bayesian  framework.  The 
latter  approach  was  taken  here. 
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2.1.3  The  Conditional  Prior  Distribution 

Suppose  the  observation  noise  distribution  is  a  member  of  the  e-contaminated  normal  class  of 
distributions 

Pe,R  =  {(1  -  f)A^(0,i?)  -\-zH-He  S}  (2.11) 

where  is  the  set  of  all  suitably  regular  probability  distributions,  and  0  <  s  1  is  the  known 
fraction  of  “contamination.”  This  form  of  the  observation  noise  distribution  can  be  used  in  an 
asymptotic  expansion,  in  order  to  obtain  a  first-order  approximation  of  the  conditional  prior  dis¬ 
tribution  p(0„|2n_i)  of  the  state  variable  given  the  observations  Z„_i.  .4  key  prop''rty  that 
ensures  the  finite  dimensionality  of  this  approximation  is  the  exponential  stability  of  the  Kalntu.n 
Filter,  i.c.  the  fact  that  the  effects  of  past  observatiorts  decay  fast  enough.  The  resulting  distri¬ 
bution  is  a  perturbation  from  the  normal,  and  all  the  pertinent  parameters  are  given  by  various 
Kalman  Filters  and  optimal  smoothers  that  each  make  a  specific  assumption  on  the  distribution  of 
the  noise  at  each  point  in  time. 

The  first-order  approximation  of  the  conditional  prior  distribution  p(@„|Z„_i)  is  next  used 
to  obtain  a  first-order  approximation  of  the  conditional  mean  of  the  state  variable  6|„  '^n’en  the 
observations  Z„ — i.e.  to  update  the  estimate  by  the  current  observation  This  step  uses  a 
generalization  of  a  proof  due  to  [28,  29],  made  possible  by  a  change  in  the  order  of  integration.  A 
similar  derivation  also  yields  the  conditional  covariance. 

From  2.11,  and  assuming  for  now  that  H  =  H*  is  known,  one  can  write 

Hn  =  (1  -  nn)v:n  +  VnUn  (2.12) 


where  r]n  is  a  random  variable  independent  of  go  and  {in„}  obeying 


Vn  = 


0  w.p.  (1  -  e) 
1  w.p.  5 


(2.13) 


and  {v'^ }  and  {v^}  are  random  variables  independent  of  go>  and  {i£„}  with  Ciir^)  =  .'’\/’(0,  R) 
(for  some  /?  >  0)  and  C{y^)  —  H* .  Finally,  loosely  defining  a  random  variable  distributed  as  H' 
as  an  “outlier,”  denote  the  event  “there  has  been  no  outlier  among  the  first  n  observations"  by 
R-n  =  {^0  =  0,  •  •  • ,  J?n  =  0})  and  the  event  “there  has  been  exactly  one  outlier  among  the  first  n 
observations,  at  time  i  —  1”  by  =  {770  =  0,  •  •  • ,  r}i^2  =  0,  775-1  =  I,  77,  =  0,  •  ■  •  ,  77„  =  0}.  Then,  it 
is  easy  to  verify  that 


P(g„j^n-l)p(^n-l) 


=  p(?f„_i)p(2:„_iiH„_i)p(g„|Z„_i,'H„-i) 


n 

+  5^p(K-l)p(255-l|K-l)P(^n|2„-l,K-l) 


(2.14) 


t-1 

-f-  higher-order  terms. 

Clearly,  the  first  term  on  the  right-hand  side  of  2.14  is  the  distribution  conditioned  on  the  event 
that  there  were  no  outliers,  each  term  in  the  summation  to  the  event  that  there  was  exactly  one 
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outlier,  and  the  higher-order  terms  to  the  occurrence  of  two  or  more  outliers.  Moreover,  defining 
2},  ==  it  follows  that 

.OIK. 

Note  that  the  only  non-normal  term  on  the  right-hand  side  of  2.15  is  the  last  one.  All  other  terras 
in  2,15,  as  well  as  in  2.14,  are  normal.  These  remarks  are  formalized  in  the  following  theorem. 
Note  first  that  if  the  system  is  completely  observable  and  completely  controllable,  then  given  any 
^  <  oo.  and  defining  the  clused-loop  recursion 

=  (/  -  JVn+li/a+l)F„£„,  (2.16) 

there  exist  A  >  0  and  0  <  <5  <  1  such  that 

(2.17) 

i.e.  the  filter  is  exponentially  asymptotically  stable. 


Theorem  2.1.1  Let  the  system  given  by  Equations  2. 1-2.2  be  stable,  and  let  b  be  a  real  number 
for  which  2.17  holds.  Let  uj  be  the  smallest  integer  such  that 


(5“  <  e 

(2.18) 

If 

we  <  1 

(2.19) 

and  if  the  distribution  H*  has  bounded  moments,  then 

piejZn-i)  = 

{l-erKnKlM{dn.€.M^) 

(2.20) 

i=n—w+l 

(2.21) 

j +  -e), 

(2.22) 

(2.23) 

+ 

(2.24) 

for  all  n>  LJ,  where,  for  z  =  0, 1,  ■ 

■  ■  and  n  >  i, 

(2.25) 

Tn=^n+Kl!r, 

(2.26) 

m;  =  -h  Qn-1 

(2.27) 

7;  = 

(2.28) 

rj.  =  +  DnRDl 

(2.29) 

o 


{2.30) 

-  KirjcJ 

(2.31) 

and 

(2.32) 

for  1  —  1, 2,  •  •  •  and  n  >  i. 

(2.33) 

^'n  -  idi-l  + 

(2.34) 

(2.35) 

subject  to  the  initial  conditions 

(2.36) 

II 

1 

+ 

T 

(2.37) 

(2.38) 

£1  =  ^1 

(2.39) 

0.1 

II 

(2.40) 

—  '^?-l 

(2.41) 

for  i  >  0,  and 

II 

(2.42) 

Mo°  =  Mo 

(2.43) 

«o  =  l- 

(2.44) 

The  normalization  constant  satisfies 

(2.45) 

+e(l  -  E  <  /  -^(^.-1  - 

(2.46) 

1=71— W+l 

H,_,WfHl,)dHm) 

(2.47) 

(The  case  n  <  u)  is  very  similar.) 

Proof.  See  [34,  35]. 

■ 

Note  that  Equations  2.25-2.31  are  a  bank  of  Kalman  Filters,  each  starting  at  a  different  point 
in  time  i  =  0, 1, 2,  ■  •  Because  of  the  way  in  which  they  are  initialized,  the  cases  ?'  >  0  correspond 
to  Kalman  Filters  skipping  the  i  —  1st  observation.  The  case  i  =  0  is  based  on  all  observations. 
Similarly,  Equations  2.33-2.35  are  a  bank  of  optimal  fixed-point  smoothers,  each  estimating  the 
state  at  a  different  point  in  time  z  —  0, 1, 2,  •  -  based  on  all  proceeding  and  subsequent  observations. 
Thus,  each  term  in  the  summation  on  the  right-hand  side  of  2.24  is  a  Kalman  Filter  that  skips  one 
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observation,  coupied  with  an  optimal  smoother  that  estimates  the  state  at  the  time  the  observation 
is  skipped. 

Evidently,  as  n  — ♦  oo,  the  probability  of  the  event  that  only  a  finite  number  of  outliers  occur 
vanishes  for  any  £  >  0.  That  the  density  can  nevertheless  be  approximated  by  the  first-ortler 
expression  in  2.24  is  due  to  the  exponential  asymptotic  stability  of  the  Kalman  Filter:  w'  repre.senfs 
a  “window  size”  beyond  w'hich  the  effects  of  older  observations  have  sufficiently  attenuated. 


2.1.4  The  Conditional  Mean  Estimator 

The  approximate  conditional  prior  distribution  of  Theorem  2.1.1  is  now  used  to  derive 

the  conditional  mean  and  variance,  respectively  denoted  by 

=  E[g„|Z„l  (2.48) 

and 

Sn  =  E[{g„  -  ZMn  ~  ZyiZn].  (2.49) 

Let  h*  denote  the  density  associated  with  H*,  provided  that  it  exists. 


Theorem  2.1.2  Let  the  conditions  of  Theorem  2.1.1  be  satisfied  for  the  system  given  by  Equa¬ 
tions  2. 1-2.2.  If  h*  exists  and  is  bounded  and  differentiable  a.e.,  then 


r,  =  (i-£)“K„+i7r0rU£(i-cr--‘K„4.i  E  <r„  +  Op(a:V) 


i—n-w+1 


for  all  n  >  u>,  where 


7r°  -  (1  -  J Hr^H 


and  the  influence- bounding  functions  are  given  by 


tlio  - 


((1  -  eW{C,0Sl)  +  efm-  e;0,  HnM^H^)h-{i)dl) 

~(1  -  £)7\^(C;0,r«)  +  £  fm  - 


rjo  =  - 


(2.50) 


(2.51) 

(2.52) 

(2.53) 


(2.54) 


(2.55) 


(2.56) 


wit/i  6^,7^,  K"* ,  M* ,  P' ,  r^,  V^,  nJi,  and  k„  as  defined  in  Theorem  2.1.1,  subject  to  the  same 

initial  conditions.  Furthermore, 


E„  =  (1  -  £)“«„+i7r°S«  +  £(1  -  E  +  Op(:jV) 


(2.57) 


t=Tl— W-f  I 
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for  all  n  >  u.'.  where 


Si:  -  .\C  -  -  //„C)//..u«  +  (I,.  -  -C)(T„  -  rii)'' 

S‘  =P'  +  (I.  -ry 


and  ^'5,  IS  given  by 


(The  case  n  <  is  very  similar.) 


KiQ  =  '^.^iO- 


{2.58) 

(2.59) 

(2.60) 


Proof.  See  [34,  35],  ■ 

Both  Theorem  '2.1.1  and  Theorem  2.1.2  are  based  on  the  assumption  that  outliers  occur  rarely 
relative  to  the  dynamics  of  the  filter.  In  the  unlikely  event  that  two  outliers  occur  within  less  than 
u.'  time  steps  of  each  other.  Equation  2.52 — which  shows  that  T[,  is  linear  in  -suggests  that  the 
estimate  would  be  strongly  affected.  This  implies  that  the  estimator  developed  here  is  robust  in 
the  presence  of  rare  and  isolated  outliers,  but  not  when  outliers  occur  in  batches. 

The  estimator  is  a  weighted  sum  of  stochastic  approximation-like  estimators,  with  weights  equal 
to  the  posterior  probabilities  of  each  outlier  configuration.  These  probabilities  are  conditioned  on 
all  the  observations,  including  the  current  one.  Since  the  banks  of  parallel  filters  and  smoothers 
are  entirely  independent  of  each  other,  the  estimate  derived  here  is  well  suited  to  parallel  compu¬ 
tation.  Furthermore,  the  covariance  is  a  function  of  a  set  of  matrices  (Mn),  {P,[},  {FJ,}.  {K!}.  and 
which  are  themselves  independent  of  the  observations.  Thus,  they  can  be  pre-computed  and 
stored,  as  is  sometimes  done  with  the  Kalman  Filter.  Although  the  covariance  given  by  2.57  is  not 
independent  of  the.  data  (a  feature  that  would  only  be  optimal  in  the  normal  case),  this  implies 
that  a  great  deal  of  computation  may  nevertheless  be  performed  off-line. 

Finally,  it  is  easy  to  verify  that,  for  s  =  0, 


vyV(rO;r|:)U,o 

/V(7«;0,r«) 


(2.61) 

(2.62) 


so  that  reduces  to  the  Kalman  Filter  when  the  noise  is  normally  distributed. 


2.2  X-Ray  Crystallography 
2.2.1  Introduction 

A  new  Markov  random  field-based  algorithm  has  been  proposed  for  signal  reconstruction  from 
Fourier  transform  magnitude  motivated  by  the  data  reduction  calculations  of  X-ray  crystallogra¬ 
phy  [12,  5,  11,  14,  8,  6,  7,  9,  13,  15,  10],  The  purpose  of  an  X-ray  crystallography  experiment  is 
to  determine  the  position  in  three  dimensional  space  of  each  atom  in  a  molecule.  The  measured 
data  are  the  magnitudes  squared  of  the  Fourier  transform  of  the  electron  density  function  of  a 
crystal  of  the  molecule  of  interest  and  po.ssibly  also  of  chemical  derivatives.  The  data  reduction 
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calculations  arc  a  signal  rcconstrnction  i)rol)l(‘ni  for  the  tlirec  diiiicnsional  electron  densitv.  in  the 
so-called  ■‘direct"  methods  of  interest  here,  the  reconstruction  is  ha.sed  on  a  uoisv  measurement  of 
the  magnitude  squared  of  the  Fourier  transform  of  the  electron  densitv  of  a  single  crystal,  that  is. 
ito  chemical  derivatives  of  the  molecule  are  sttidied. 

The.se  reconstruction  i)roblems  are  unusual  [30.  21].  For  instance,  it  is  the  [teriodicity  of  the 
crystal  that  samples  the  Fourier  transform  of  the  three  dimeinsional  repeat  tmit  (called  a  •unit 
cell"),  so  that  the  sampling  is  beyond  the  control  of  the  investigator,  and  the  sampling  rate  is 
below  the  .Nyquist  rate  for  the  autocorrelation  function  that  can  be  computed  frotn  the  available 
Fourier  transform  magnitudes.  Furthermore,  the  electron  density  is  invariant  under  a  sjrace  group 
symmetry. 

The  most  powerful  direct  methods  are  probabilistic  in  nature  [20.  1.  2j.  are  based  on  a  model  in 
which  the  atomic  locations  are  independent  random  variables,  and  are  successful  on  small  molecules. 
The  failure  of  these  techniques  to  extend  to  larger  molecules  is  attributed  by  Bricogne  [1.  2j  to  in¬ 
consistent  usage  of  probabilistic  information  and  inaccurate  computation  of  marginal  probabilities. 
In  addition,  he  notes  the  very  idealized  nature  of  the  standard  independent  atomic  location  hy¬ 
pothesis. 

There  are  three  major  themes  in  the  work  reported  here:  tractable  iucor]>oration  of  a  prior) 
information,  consistent  u.se  of  probabilistic  information,  and  analytical  (rather  than  numerical)  ap¬ 
proximations.  The  starting  point  is  a  Markov  random  field  (MRF)  a  prion  model  for  the  electron 
density:  a  Bayesian  statistical  problem  who.se  solution  is  the  thresholded  conditional  mean  of  the 
MRF  given  the  data  is  defined:  and  the  conditional  mean  is  approximately  computed  using  sym¬ 
metry  breaking,  the  spherical  model,  and  small  noise  asymptotics.  Initial  results  from  work  at  MIT 
are  reported  in  [5.  8,  G.  7]  and  further  results  from  work  continued  at  Purdue  University  (School  of 
Electrical  Engineering)  are  described  in  [12.  11,  14.  9.  13.  15,  10). 

2.2.2  The  MRF  a  prion  Model  and  the  a  posteriori  Model 

The  .MRF  defines  a  probability  distribution  on  a  collection  of  binary  random  variables  €  {(),  1} 
which  lie  on  a  lattice.  The  connection  between  the  MRF  and  the  electron  density  is  that  the 
atoms  are  restricted  to  lie  on  the  lattice  and  site  n  is  occupied  by  an  atom  if  and  only  if  0,7  =  1. 
This  construction  assures  a  positive  and  atomic  electron  density.  It  remains  to  arrange  the  correct 
spacing  between  atoms,  which  is  achieved  by  the  Hamiltonian  of  the  .MRF.  is  the 

.sum  of  energies  (07  a.ssociated  with  each  site  in  the  lattice.  The  idea  behind  is  simple:  If  an 
atom  is  not  pre.sent  at  site  n  then  =  0.  If  an  atom  is  present  at  site  ii  then 

1.  if  other  atoms  are  located  within  a  minimum  bond  radius  of  length  ci  then  is  positive 
because  the  atoms  are  unphysically  close  while 

2.  if  no  other  atoms  are  located  within  a  maximum  bond  radius  of  length  r^  then  is  positive 
because  the  atom  at  site  ii  is  floating  free  unbound  in  the  molecule  wdiile 

3.  if  one  or  more  atoms  are  locat.ed  between  the  minimum  and  maximum  bon<l  radii  and  noiu' 
are  located  closer  than  the  minimum  bond  radius  then  it  a  is  negative  because  the  atom  at 
site  n  can  be  correctly  bound. 
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While  a  variety  of  complicated  functional  forms  can  be  chosen  for  a,,-,  good  success  lias  been  ai  ilie^•e<i 
with  (piadratic  forms  which  make  possible  a  wide  range  of  analytic  calculations.  .\ote  that  //■‘t'''"n 
is  invariant  under  translations,  rotations,  and  inversions  of  the  Held  o- 

In  light  of  the  relationship  between  o  and  the  electron  density,  the  exact  oliservations  are  the 
magnitude  squared  of  the  Fourier  coefHcients  of  o.  The  actual  data  /y^r  are  additivel\’  eorrupted  la 
noise  which  is  modeled  as  Gainssian  with  zero  mean  and  known  F-dependent  variance  rri. 

The  joint  and  a  posteriori  distributions  on  o  and  y  can  be  written  as  MRFs  so  the  cah  ulatiou 
of  the  conditional  mean  is  simply  the  calculation  of  the  spatially  varying  mean  of  this  new  .\IRF. 
The  Hamiltonian  for  this  MRF  is  /jot's  where  //”*'*  romes  from  the  Gaussian  conditional 

observation  distribution.  There  is.  however,  a  problem.  Specifically,  the  invariance  of under 
translations,  rotations,  and  inversions  of  the  field  O  and  the  lack  of  phase  measurements  implit's 
that  the  mean  of  the  new  MRF  is  a  constant,  lii  order  to  solve  this  problem  the  Hamiltonian  is 
modified  by  introducing  a  symmetry  breaking  term  M-  which  is  proportional  to  Yin  i-'-nOr,-  This 
is  a  good  choice  becau.se  for  suitable  i.\  called  the  "kerner,  it  breaks  the  symmetries  and  because 
it  is  linear  and  can  thus  be  viewed  as  a  small  perturbation.  The  values  of  the  variables  {(.’rTl  '*re 
set  bv  a  data  adaptive  optimization  described  below. 

2.2.3  Bayesian  Estimation  and  Computation  of  the  Conditional  Mean 

The  cost  that  is  minimized  in  order  to  derive  the  Bayesian  estimator  is  the  mean  scpiared  error 
betw'een  the  true  and  reconstructed  fields.  For  these  binary  fields,  the  ’'segmentation"  cost  that 
applies  an  equal  penalty  to  any  reconstruction  error  leads  to  the  same  estimator.  The  result  of 
the  minimization  problem  is  that  the  estimator  has  two  steps:  first  compute  the  conditional  mean 
of  the  electron  density  o  given  the  data  y  and  then  threshold  the  res\ilt  at  value  1/2  so  that 
sites  with  conditional  mean  greater  (less)  than  1/2  take  value  1  (0).  .As  mentioned  above,  the 
conditional  mean  is  computed  by  computing  the  spatially  varying  mean  of  the  new  MRF  which 
has  Hamiltonian  //^P‘''°''>  +  ^  .  This  calculation  is  done  through  two  approximations; 

First,  the  spherical  model  is  introduced  in  order  to  relax  the  On  €  {0-1}  con.straint.  It  transforms 
a  sum  over  the  corners  of  a  hypercube  into  an  integral  over  the  surface  of  a  hypersphere  inscribed 
around  the  hypercube.  Half  of  the  integrations  can  be  done  analytically  but  the  remaining  half  are 
intractable  exponential-of-quartic  integrations.  Therefore  the  second  approximation  is  made  which 
is  the  evaluation  of  these  integrals  by  small-noise  asymptotic  techniques  where  the  ".small  noise" 
refers  to  small  observation  noise,  i.c..  small  rr|.  (This  is  the  relevant  limit  in  X-ray  crystallography). 
The  key  step  in  the  a.symptotics  is  the  calculation  of  the  critical  point  (i.e..  the  global  minimum 
of  the  exponent),  which  can  be  done  exactly  with  computation  linear  in  the  size  of  the  MRF 
lattice.  The  results  of  these  tw'o  approximations  are  ai’alytic  formulas  for  the  conditional  mean  of 
the  Fourier  coefficients  of  the  field  given  as  functions  of  the  critical  point  and  the  kernel  t'  of  the 
symmetry  breaking. 

2.2.4  Data  Adaptation 

The  kernel  yj  is  chosen  to  minimize  a  cost  function  of  the  conditional  mean  of  the  field  o  given 
the  data  y.  This  optimization  makes  the  estimator  adapt  to  the  data.  The  jjrimary  purpose  of 
the  adaptation  is  to  ameliorate  the  errors  introduced  by  the  spherical  model.  The  cost  penalizes 
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excursions  of  the  mean  outside  of  the  interval  [O.I)  (which  are  exclusively  due  to  the  ai)j)r()ximatiuns 
since  On  €  {0, 1}),  penalizes  excursions  from  the  two  endpoints  0  and  I  (since  one  desires  a  c  tliat 
results  in  a  confident  estimator),  and  penalizes  deviations  of  the  energy  in  v  from  a  target  (.since 
one  does  not  want  i/>  to  vanish  ami  hence  fail  to  break  the  symmetries  or  to  grow  too  large  .so  that 

^  dominates  the  total  Hamiltonian). 

Once  y  is  chosen  the  conditional  mean  of  the  Fourier  coefficients  of  the  field  o  can  be  calculated. 
Then  the  Fourier  series  is  inverted  to  compute  the  conditional  mean  of  the  field  o.  Finallv.  the 
conditional  mean  is  thresholded  at  value  1/2,  that  is,  an  atom  is  jtlaced  at  each  lattice  site  where 
the  conditional  mean  exceeds  1/2.  One  and  two  dimensional  numerical  examjtles  are  given  in  16]. 

2.2.5  Incorporation  of  Space  Group  Symmetries 

The  unit  cell  is  the  periodic  repeat  unit  of  the  crystal.  The  presence  of  a  nontrivial  space  grouj) 
symmetry  means  that  there  is  additional  structure  within  the  unit  cell.  For  example,  the  unit  cell 
might  be  divided  in  half  with  the  electron  density  in  one  half  the  mirror  image  of  the  electron  density 
in  the  other  half.  The  space  group  is  known  before  the  reconstruction  i.s  done.  In  one  dimeiision 
there  are  only  two  space  groups:  the  trivial  group  PI  where  there  is  no  structure  within  the  tinit 
cell  (i.e.,  a  periodic  function)  and  the  group  PI  for  which  there  is  a  mirror  point  of  .symmetry  in 
the  middle  of  the  unit  cell  (i.e.,  periodic  and  even).  In  three  dimensions  the  situation  is  much  more 
complicated  and  there  are  a  total  of  230  space  groups  [21j. 

Three  approaches  to  solving  signal  reconstruction  problems  in  the  presence  of  nontrivial  space 
groups  are  described  {12,  11,  14,  15,  10].  In  Approach  1,  the  actual  space  group  Q  is  replaced  by 
the  subgroup  PI,  the  signal  reconstruction  results  of  [8,  6]  are  applied,  and  then  the  invariance 
under  S  information  is  added  in  two  ways.  First,  reconstructions  that  are  invariant  under  PI  but 
not  ^  are  transformed  into  reconstructions  invariant  under  ^  by  averaging.  Second,  the  invariance 
of  the  signal  under  ^  is  applied  as  a  soft  constraint  by  adding  a  term  to  the  C  cost  function 
for  Ip  optimization.  The  advantage  of  Approach  1  is  simplicity  since  Ref.  [8,  6]  is  applied  with 
little  alteration  to  any  space  group  The  disadvantage  is  the  suboptimal  u.se  of  space  group 
information.  Furthermore,  the  data  adaptation-minimization  of  C  with  respect  to  t’-occurs  in  a 
higher  dimensional  space  than  is  necessary.  Symmetry  breaking  is  retained. 

The  second  and  third  approaches  both  integrate  the  presence  of  the  space  group  ^  as  a  hard 
constraint  into  the  signal  reconstruction  process.  The  two  approaches  differ  by  the  order  in  which 
noncommuting  nonlinear  operations  are  performed:  in  Approach  2  the  spherical  model  is  applied 
before  the  space  group  symmetry  is  enforced  (so  that  the  spherical  model  r  applied  to  the  entire 
unit  cell)  while  in  Approach  3  the  order  is  reversed  (so  that  the  spherical  model  is  applied  only  to 
the  asymmetric  unit).  (The  asymmetric  unit  is  a  minimum  subset  of  the  unit  cell  that  is  sufficient 
to  determine  the  electron  density  in  the  entire  unit  cell).  The  advantage  of  Approach  2  is  that  the 
calculation  of  the  critical  point  in  the  small  observation  noise  asymptotics  is  only  slightly  changed 
from  Refs.  [8][6,  Appendix  A).  Therefore  it  can  be  done  analytically.  The  disadvantage  is  that  the 
spherical  model  approximation  is  applied  over  a  larger  number  of  sites  (the  entire  unit  cell)  and 
so  it  is  less  accurate.  Symmetry  breaking  is  required.  The  advantage  of  Approach  3  is  that  the 
spherical  model  is  applied  over  a  smaller  number  of  sites  (only  the  asymmetric  unit)  and  so  it  is 
more  accurate.  The  disadvantage  is  that  the  calculation  of  the  critical  point  in  the  small  observation 
noise  asymptotics  is  substantially  more  difficult  than  '.n  Refs.  [8][6,  Appendix  A]  and  to  date  an 
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analytical  solution  is  available  onlv  for  a  special  case.  Syniinetrv  breakinj^  i.s  not  ii’quired.  iiurronny 
the  fact  that  symmetry  breaking  is  not  retjnired  in  an  exact  solutioji.  In  fact,  if  used.  syjnnifUrN' 
breaking  only  influences  the  value  of  second  and  higher  ortier  terms  in  tlie  asymptotic  cxjjan.sioii. 
In  both  Approaches  2  and  3  the  data  adaptation  occurs  in  the  smallest  possible  dimensioual  sitace. 

The  methods  are  compared  [12.  11.  14,  15,  10]  in  ID  for  space  group  Pi.  Figure  1  shows  i)erfor- 

mance,  measured  as  EyYini‘?n  -  '?>()“-  for  six  different  estimators  as  a  function  of  tlie  observation 
noise  standard  deviation  a  for  a  L  =  17  sites  lattice.  .All  results  are  Monte  Carlo  computatiotis 
using  1000  realizations.  The  dashed  lines  E  and  A  are  estimators  from  Refs.  [13,  ts,  Oj  which  are 
unaware  of  the  presence  of  Pi  symmetry.  E  is  the  exact  estimator  computed  by  explicitly  sum¬ 
ming  over  all  o  configurations.  Symmetry  breaking  is  present.  This  estimator  is  totally  impractical 
for  any  reasonable  sized  lattice  and  is  the  reason  for  the  choice  of  L  =  17  for  these  simulations. 
However,  it  is  the  optimal  Bayesian  estimator  in  the  absence  of  space  group  information.  .4  is  the 
approximate  estimator  from  Refs.  [13.  8,  6].  The  solid  lines  are  estimators  that  are  aware  of  the 
presence  of  PI  symmetry.  E*  is  the  exact  estimator  computed  by  explicitly  stunming  over  the 
Pi  symmetric  subset  of  6  configurations  (but  with  symmetry  breaking  turned  off).  .41.  .42.  and 
.43  are  the  .Approach  1,  Approach  2,  and  .Approach  3  estimators.  The  critical  point  for  the  small 
observation  noise  asymptotics  for  Approach  3  was  determined  numerically  by  N'20NG  (Ref.  [22. 
Section  8.4,  pp.  903-908]). 

Note  several  aspects  of  these  numerical  results;  Knowledge  that  the  signal  is  Pi  symmetric 
is  very  valuable-compare  E  with  E*\  with  knowledge  that  the  signal  is  Pi  symmetric,  symmetry 
breaking  is  not  required-see  P*;  .Approaches  1  and  2  provide  roughly  equivalent  performance, 
performance  that  sometimes  exceeds  that  of  the  optimal  estimator  E  that  is  unaware  of  the  Pi 
symmetry  (and  is  very  expensive  to  compute);  and  Approach  3  provides  poor  performance  which 
is  attributed  to  the  lack  of  data  adaptation. 

2.2.6  Analytical  Gradients  for  Data  Adaptation  Optimization 

In  the  cited  work,  the  optimization  of  t/'  was  done  using  a  multidimensional  dow’nhill  simplex 
method  [33,  Section  10.4  pp.  305-309],  Evaluating  the  cost  function  requires  two  FFTs.  .A  natural 
improvement  is  to  use  a  conjugate  gradient  algorithm  with  analytical  gradients.  The  fact  that 
the  gradient  can  be  computed  analytically  is  not  surprising  though  the  calculation  requires  care 
because,  for  example,  V'  is  real  so  that  'F  is  conjugate  symmetric.  What  is  surprising  is  that  the 
cost  function  and  its  complete  gradient  can  be  computed  at  a  cost  of  four  FFTs-only  twice  as  much 
computation  as  was  required  for  the  function  value  alone. 

The  algorithm  for  efficient  gradient  calculation  has  been  worked  out  in  3D  for  an  Approach  2 
estimator  for  the  monoclmic  C2  space  group.  (The  equations  are  not  included  here).  Results  using 
a  Fletcher-Reeves-Polak-Ribiere  conjugate  gradient  algorithm  [33,  Section  10.6  pp.  318-322]  on  a 
4x4x4  problem  with  a  2%  observation  noise  standard  deviation  (realistic  for  small  molecule  X-ray- 

crystallography)  are  shown  in  Figure  2.  The  z  axis  is  EyJ2ni4>n  —  and  the  x  and  y  coordinates 
are  two  parameters  in  the  cost  function  C.  Note  both  the  excellent  performance  achieved  and  the 
relative  insensitivity  of  the  performance  to  the  values  of  the  two  parameters.  Extension  of  these 
results  to  experimental  data  is  currently  underway. 
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Figure  2:  Estimator  performance  statistics  on  3D  monoclinic  C2  problems  as  a  function  of  the 
parameters  of  C. 
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2.3  Recursive  Stochastic  Algorithms  for  Global  Optimization  in  /?'' 

2.3.1  Introduction 

.4  class  of  algorithms  for  finding  the  global  minimum  of  a  smooth  function  6'(j  ),x  6  are  termed 
Modified  Stochastic  Gradient  Algorithms.  This  section  analyzes  the  convergence  of  algorithms  of 
the  form 

1  A\.  -  mt(VL-{XO  +  +  hWk,  (2.63) 

where  {^/t}  is  a  sequence  of  valued  random  variables.  {IT*.}  is  a  sequence  of  .sia.idard  d- 
dimensional  independent  Gaussian  random  variables,  and  {nr-},  {6*,}  are  sequences  of  positive 
numbers  with  ajt,  6^  — >  0.  An  algorithm  of  this  type  arises  by  artificially  adding  the  term 

(via  a  Monte  Carlo  simulation)  to  the  standard  stochastic  gradient  algorithm 

Zk+i  -  Zk  -  afc(VT'(Zfc)  +  ^it).  (2.G4) 

.41gorithms  like  2.64  arise  in  a  variety  of  optimization  problems  including  adaptive  filtering, 
identification  and  control;  here  the  sequence  is  due  to  noisy  or  imprecise  meiisurernents  of 
VC/{')  (c.f.  [26]).  The  asymptotic  behavior  of  {Zk}  has  been  much  studied.  Let  5  and  S*  be  the  .set 
of  local  and  global  minima  of  (/(■),  respectively.  It  can  be  shown,  for  example,  that  if  f/(  )  and 
are  suitably  behaved,  aj;  =  A/k  for  k  large,  and  {Zk}  is  bounded,  then  Z*.  — *  S  as  k  —  oo  w.p.l. 
However,  in  general  Zk  -h  S*  (unless  of  course  S  =  S*).  The  idea  behind  adding  the  additional 
bkWk  term  in  2.63  compared  with  2.64  is  that  if  bk  tends  to  zero  slowly  enough,  then  possibly 
(Xjc)  (unlike  {Zk}]  will  avoid  getting  trapped  in  a  strictly  local  minimum  of  L’(  )  (this  is  the  usual 
reasoning  behind  simulated  annealing-type  algorithms).  We  shall  in  fact  show  that  if  U{-)  and 
are  suitably  behaved,  Ok  =  A/k  and  b\  =  B /k\og\ogk  for  k  large  with  B/A  >  Co  (where  Co  is 

a  positive  constant  which  depends  only  on  (/(•)),  and  {X*}  is  tight,  then  Xk  — ♦  S*  as  A:  — ►  oc  in 

probability.  We  also  give  a  condition  for  the  tightness  of  {Xjt}.  We  note  that  the  convergence  of 
Zfc  to  5  can  be  established  under  very  weak  conditions  on  assuming  {Zk}  is  bounded.  Here 
the  convergence  of  X^  to  S*  is  established  under  somewhat  stronger  conditions  on  assuming 
that  {Xfc}  is  tight  (which  is  weaker  than  boundedness). 

2.3.2  Convergence  of  the  Modified  Stochastic  Gradient  Algorithm 

The  analysis  of  the  convergence  of  {X*}  is  usually  based  on  the  asymptotic  behavior  of  the  asso¬ 
ciated  ordinary  differential  equation  (ODE) 

zt=~VUiz{t))  (2.65) 

(c.f.  [26,  27]).  This  motivates  our  analysis  of  the  convergence  of  {Xt}  based  on  the  asymptotic 
behavior  of  the  associated  stochastic  differential  equation  (SDE) 

dY(t)  =  -XUiY{t))dt  -f-  cit)dW{t),  (2.66) 

where  W(-)  is  a  standard  d-dimensional  Wiener  process  and  c(-)  is  a  positive  function  with  c{t)  — *  0 
as  t  >  oo.  This  is  just  the  diffusion  annealing  algorithm  discussed  in  [31,  Equation  (4.3)].  with 
T{t)  =  c^(t)/2.  The  asymptotic  behavior  of  Y{t)  as  t  — +  oo  has  been  studied  intensively  by  a 
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number  of  researchers.  In  [19,  25j.  convergence  results  where  ol)tained  by  considering  a  version 
of  2.66  with  a  reflecting  boundary;  in  [3j.  the  reflecting  i)oundarv  was  removed.  Our  ajialysi.s  of 
{A'fc}  is  based  on  the  analysis  of  Y{t)  developed  in  [3j.  where  the  following  res\ilt  is  proved;  if 
U{-)  is  well-behaved  and  c*(f)  =  C/logt  for  t  large  with  C  >  Cq  (the  same  constant  Cq  as  above) 
then  }'(f)  — *  S*  as  t  oc.  To  see  intuitively  how  {AT}  and  V'(  )  are  related,  let  fc  = 

Uk  =  A/k,bl  =  S/A:  log  log  A;,  c‘(f)  =  C/logf,  and  B/A  —  C.  Note  that  6*.  ~  Then  we 

should  have  that 

T(4+i)  V'(T)  -  (4+1  ~  4)VbTr(4))  +<'{4){nT4+i)  -  H'(4)) 

=  V'(4)  -  a*.VS(F(4))  + 

~  y{h)-ak^U{Y{tk))  +  hkVk 

where  {14}  is  a  sequence  of  standard  d-dimensional  independent  Gaussian  random  variables.  Henci' 
(for  small  enough)  {AT}  and  {F(4)}  should  have  approximately  the  same  distributions.  Of 
course,  this  is  a  heuristic;  there  are  significant  technical  difficulties  in  using  >'(')  fo  analyze  {AT} 
because  we  must  deal  with  long  time  intervals  and  slowly  decreasing  (unbounded)  Gaussian  random 
variables. 

An  algorithm  like  2,63  was  first  proposed  and  analyzed  in  [25],  However,  the  analysis  required 
that  the  trajectories  of  (Affc}  lie  within  a  fixed  ball  (which  as  achieved  by  modifying  2,63  near  the 
boundary  of  the  ball).  Hence  such  a  version  of  2.63  is  only  suitable  for  optimizing  C‘(-)  over  a 
compact  set.  Furthermore  the  analysis  also  required  to  be  zero  in  order  to  obtain  convergence. 
In  our  first  analysis  of  2.63  in  [16],  we  also  required  that  the  trajectories  of  {X*:}  lie  in  a  compact 
set.  However,  our  analysis  did  not  require  to  be  zero,  which  has  important  implications  when 
V[/(‘)  is  not  measured  exactly.  In  our  later  analysis  of  2.63  in  [17],  we  removed  the  requirement 
that  the  trajectories  of  {X*;}  lie  in  a  compact  set.  From  our  point  of  view  this  is  the  most  significant 
difference  between  our  work  in  [17]  and  what  is  done  in  [25,  16]  (and  more  generally  in  other  work  on 
global  optimization  such  as  [4]);  we  deal  with  unbounded  processes  and  establish  the  convergence, 
of  an  algorithm  which  finds  a  global  minimum  of  a  function  when  it  is  not  specified  a  priori  wffiat 
bounded  region  contains  such  a  point. 

We  now  state  the  simplest  result  from  [17]  concerning  the  convergence  of  the  modified  stochastic 
gradient  algorithm  2.63.  We  will  require 

and  the  following  conditions; 

(Al)  U(-)  is  a  function  from  to  [0,  oo)  such  that  the  5*  =  {x  :  b'(x)  <  f-Ty)  V  y}  O.  (We 
also  require  some  mild  regularity  conditions  on  (/(■);  see  2.63). 

(A2)  <  oo- 

(A3)  lirnj^oo  ( |vt/(x)|  ’  }fi)  ~  ^ 

(A4)  For  k  =  0,1,...,  let  fk  be  the  cr-field  generated  by  Xq,  Wq,  . . . ,  There 

exists  an  L  >  0,  a  >  -1,  and  0  >  0  such  that 

<  LaU\Xk\^  +  1),  lE[^,|.Ffc]]  <  Laf ([X,]  +  1)  w.p.  1 
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cind  is  independent  of  ^F^.. 


Theorem  2.3.1  A  ssume  (Al)-(A4)  hold.  Lut  {.Vjt}  he  given  by  2.6.‘i.  Then  there  exists  u  constant 
Co  snch  that  for  Bj A  >  Co 

A\.  — ►  S*  u<i  ^  oc 

in  probability. 


Proof.  See  [17].  ■ 

Remarks: 

1.  The  constant  Co  plays  a  critical  role  in  the  convergence  of  ns  k  oc  and  also  Yl.t)  as 
t  —*  oo.  In  [3],  it  is  shown  that  the  constant  Co  (denoted  there  by  cq)  has  an  interpretation 
in  terms  of  the  action  functional  for  a  family  of  pcrturbetl  dynamical  systems:  see  [3]  for  a 
further  discussion  of  Co  including  some  examples. 

2.  It  is  possible  to  modify  2.63  in  such  a  way  that  only  the  lower  bound  and  not  the  upper  bound 
on  jVf/(-)l  in  (A2)  is  needed  (see  [17]). 

3.  In  [17]  we  actually  separate  the  problem  of  convergence  of  {A"*:}  into  two  parts:  one  to 
establish  tightness  and  another  to  establish  convergence  given  tightness.  This  is  analogous  to 
separating  the  problem  of  convergence  of  {AT/t}  into  two  parts:  one  to  establish  boundedness 
and  another  to  establish  convergence  given  boundedness  (c.f.  [26]).  Now  in  [17]  the  conditions 
given  for  tightness  are  much  stronger  than  the  conditions  given  for  convergence  assuming 
tightness.  For  a  particular  algorithm  it  is  often  possible  to  prove  tightness  directly,  resulting 
in  somewhat  weaker  conditions  than  those  given  in  [31,  Theorem  3.1]. 

2.3.3  Continuous-State  Markov  Chain  Algorithm 

In  this  section  w'e  examine  the  convergence  of  a  class  of  continuous-state  Markov  chain  annealing 
algorithms.  Our  approach  is  to  write  such  an  algorithm  in  the  form  of  a  modified  stochastic  gradient 
algorithm  of  (essentially)  the  type  considered  in  Section  2.3.1.  A  convergence  result  is  obtained 
for  global  optimization  over  all  of  Some  care  is  necessary  to  formulate  a  Markov  chain  with 
appropriate  scaling.  It  turns  out  that  writing  the  Markov  chain  annealing  algorithm  is  (essentially) 
the  form  2.63  is  rather  more  complicated  than  writing  standard  variations  of  gradient  algorithms 
which  use  some  type  of  (possibly  noisy)  finite  difference  estimate  ofVU(-)  in  the  form  2.64  (c.f.  [26]). 
Indeed,  to  the  extend  that  the  Markov  chain  annealing  algorithm  uses  an  estimate  of  VU(-),  it  does 
so  in  a  much  more  subtle  manner  than  a  finite  difference  approximation. 

Although  some  numerical  work  has  been  performed  with  continuous-state  Markov  chain  an¬ 
nealing  algorithm  in  [23,  36],  there  has  been  very  little  theoretical  analysis,  and  furthermore  the 
analysis  of  the  continuous  state  case  does  not  follow  from  the  finite  state  case  in  a  straightforward 
way  (especially  for  an  unbounded  state  space).  The  only  analysis  we  are  aware  of  its  in  [23]  w’here  a 
certain  asymptotic  stability  property  is  established.  Since  our  convergence  results  for  the  continu¬ 
ous  state  Markov  chain  annealing  algorithm  are  ultimately  based  on  the  asymptotic  behavior  of  the 
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diffusion  annealing  algorithm,  our  work  demonstrates  and  exploits  the  clo.se  rcdatiouship  lauwefui 
the  Markov  chain  and  diffusion  versions  of  simulated  annealing. 

We  shall  perform  our  analysis  of  continuous  state  Markov  chain  annealing  algorithms  for  a 
Metropolis  type  chain.  We  remark  that  convergence  results  for  other  continuous-state  Markov  c  iiaiti 
sampling  method-based  annealing  algorithms  (such  as  the  Heat  Bath  method)  can  be  obtamed  bv 
a  similar  procedure.  Recall  that  the  1-step  transition  probability  density  for  a  continuoTis  .state 
Metropolis- type  (fixed  temperature)  Markov  chain  is  given  by 

p{x.y)  =  q{x,y)s(x,  y)  +  rn(x)S(y  -  .r) 


where 

m(x)  =  ^  -  j  (}{x,y)s{x,y)dy 

and 

s{x,y)  =  exp(~[U(jj)  -  Uix)]+/T). 

Here  we  have  dropped  the  subscript  on  the  weighting  factor  s(x,y).  If  we  replace  the  fixed  tem¬ 
perature  T  by  a  temperature  seciuence  {Tk\  we  get  a  Metropolis-type  annealing  algorithm. 

Our  goal  is  to  express  the  Metropolis-type  annealing  algorithm  as  a  modified  stochastic  gradient 
algorithm  like  2.63  so  as  to  establish  its  convergence.  This  leads  us  to  choosing  a  nonstationary 
Gaussian  transition  density 


1 

(27r6|.cr2(x))‘^/2 


exp( 


2bla^ix)’ 


(2.68) 


nix)  = 


2ak 


where  ak{x)  =  (4k()t.t^4  i  0. 

With  these  choices  the  Metropolis-type  annealing  algorithm  can  be  expressed  as 


(2.69) 


A'fc+i  =  Xk  -  akiXUiXk)  +  n)  +  l>k(7{Xk)Wk  (2.70) 


for  appropriately  behaved  Note  that  2.70  is  not  identical  to  2.63  (because  o'(.r)  ^1),  but  is 

turns  out  that  Theorem  2.3.1  holds  for  (X*}  generated  by  either  2.63  or  2.70.  We  remark  that  the 
state  dependent  term  cr(x)  term  in  2.68-  2.69  produces  a  drift  toward  the  origin  proportional  to 
|x|,  w’hich  is  needed  to  establish  tightness  of  the  annealing  chain. 

This  discussion  leads  us  to  the  following  continuous-state  Metropolis- type  annealing  algorithm. 
Let  A^(m,  A)  denote  d-dimensional  normal  measure  with  mean  m  and  covariance  matrix  A. 

Let  {Xk}  be  a  Markov  chain  with  1  step  transition  probability  at  time  k  given  by 

p(X*;+i  € /l|Xit  =  .r)  =  /  Sk(x,y)(LN'{x,blal{x)I){y) +rnkix)\AVT.}  (2.71) 

JA 

where 

mkix)  =  1  -  Sk{x,y)dM{x,blo^{x)I)iy)  (2.72) 

(Tfc(x)  =  (a;^rlx|)yi  (2.73) 
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(2.74) 


2ak{U(y)  -  7-(xji  + 
hi  (Tiix) 


A  convergence  result  similar  to  the  previous  theorem  can  be  proved  for  the  Metropolis  tvpe 
annealing  algorithms  (c.f.  [18]). 
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